Transcript Chapter 22

Chapter 22: Advanced Querying and
Information Retrieval
 Decision-Support Systems
 Data Analysis and OLAP
 Data Mining
 Data Warehousing
 Information-Retrieval Systems
1
Database System Concepts 4th Edition
22.1
©Silberschatz, Korth and Sudarshan
Decision Support Systems
 Decision-Support systems are used to make business
decisions often based on data collected by on-line transactionprocessing systems.
 Examples of business decisions:
 what items to stock?
 What insurance premium to change?
 Who to send advertisements to?
 Examples of data used for making decisions

Retail sales transaction details

Customer profiles (income, age, sex, etc.)
2
Database System Concepts 4th Edition
22.2
©Silberschatz, Korth and Sudarshan
Decision-Support Systems: Overview
 Data analysis tasks are simplified by specialized tools and SQL
extensions
 Example tasks
 For each product category and each region, what were the total sales
in the last quarter and how do they compare with the same quarter
last year
 As above, for each product category and each customer category
 Statistical analysis packages (e.g., : S++) can be interfaced with
databases
 Statistical analysis is a large field will not study it here
 Data mining seeks to discover knowledge automatically in the form
of statistical rules and patterns from Large databases.
 A data warehouse archives information gathered from multiple
sources, and stores it under a unified schema, at a single site.
 Important for large businesses which generate data from multiple
divisions, possibly at multiple sites
 Data may also be purchased externally
3
Database System Concepts 4th Edition
22.3
©Silberschatz, Korth and Sudarshan
Data Analysis and OLAP
 Aggregate functions summarize large volumes of data
 Online Analytical Processing (OLAP)
 Tools to support interactive analysis of data, allowing data to
be summarized and viewed in different ways
 A histogram partitions the values taken by an attribute into
ranges, and computes an aggregate over the values in
each range; cumbersome to use standard SQL to
construct a histogram. Extension proposed by Red Brick:
 select percentile, avg (balance)
 from account
 group by N_tile (balance, 10) as percentile
4
Database System Concepts 4th Edition
22.4
©Silberschatz, Korth and Sudarshan
Cross Tabulation of sales by item-name
and color
 The table above is an example of a cross-tabulation(or cross-
tab) also referred to as a pivot-table. In general, a cross-table is
a table where values for one attribute form the row headers,
values for another attribute form the column headers, and the
values in an individual cell are derived as follows.
 A cross tab with summary rows/columns can be represented by
introducing a special value all to represent subtotals.
Database System Concepts 4th Edition
22.5
5
©Silberschatz, Korth and Sudarshan
Relational Representation of the Data in
Figure 22.1
6
Database System Concepts 4th Edition
22.6
©Silberschatz, Korth and Sudarshan
Three-Dimensional Data Cube
7
Database System Concepts 4th Edition
22.7
©Silberschatz, Korth and Sudarshan
Online Analytical Processing
 The operation of changing the dimensions used in a cross-tab is
called pivoting.
 An OLAP system provides other functionality as well. For
instance, the analyst may wish to see a cross-tab on item-name
and color for a fixed value of size, for example, large, instead of
the sum across all sizes. Such an operation is referred to as
slicing. The operation is sometimes called dicing, particularly
when values for multiple dimensions are fixed.
 The operation of moving from finer-granularity data to a coarser
granularity is called a rollup.
 The opposite operation - that of moving from coarser-granularity
data to finer-granularity data – is called a drill down.
8
Database System Concepts 4th Edition
22.8
©Silberschatz, Korth and Sudarshan
OLAP Implementation
 The earliest OLAP systems used multidimensional arrays in
memory to store data cubes, and are referred to as
mutidimensional OLAP (MOLAP) systems.
 Hybrid systems, which store some summaries in memory and
store the base data and other summaries in a relational
database, are called hybrid OLAP (HOLAP) systems.
9
Database System Concepts 4th Edition
22.9
©Silberschatz, Korth and Sudarshan
Hierarchies on Dimensions
10
Database System Concepts 4th Edition
22.10
©Silberschatz, Korth and Sudarshan
Cross Tabulation of sales With Hierarchy
on item-name
11
Database System Concepts 4th Edition
22.11
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
Small
Medium
Large
Total
Light
Dark
8
20
35
10
310
5
53
35
Total
28
45
15
88
 Cross-tabulation of number by size and color of sample
relation sales with the schema Sales(color, size, number).
12
Database System Concepts 4th Edition
22.12
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
Color
Size
Number
Light
Light
Light
Light
Dark
Dark
Dark
Dark
all
all
all
all
Small
Medium
Large
all
Small
Medium
Large
all
Small
Medium
Large
all
8
35
10
53
20
10
5
35
28
45
15
88
 Can represent subtotals in relational form by using the value all
 E.g. : obtain (Light, all, 53) and (Dark, all, 35) by aggregating
individual tuples with different values for size for each color.
13
Database System Concepts 4th Edition
22.13
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
 Rollup: Moving from finer-granularity data to a coarser
granularity by means of aggregation.
 Drill down: Moving from coarser-granularity data finer-
granularity data.
 Proposed extensions to SQL, such as the cube operation
help to support generation of summary data
 The following query generates the previous table.
select color, size, sum (number)
from sales
groupby color, size with cube
14
Database System Concepts 4th Edition
22.14
©Silberschatz, Korth and Sudarshan
Data Analysis (Cont.)
 Figure shows the combinations of dimensions size, color, price
 In general computing cube operation with n groupby columns gives
2nd different groupby combinations.
15
Database System Concepts 4th Edition
22.15
©Silberschatz, Korth and Sudarshan
Data Mining
 Like knowledge discovery in artificial intelligence data mining
discovers statistical rules and patterns it differs from machine
learning in that it deals with large volumes of data stored
primarily on disk.
 Knowledge discovered from a database can be represented by
a set of rules.
 e.g.,: “Young women with annual incomes greater than $50,000
are most likely to buy sports cars”
 Discover rules using one of two models:
1. The user is involved directly in the process of knowledge
discovery.
2. The system is responsible for automatically discovering
knowledge from the database by detecting patterns and
correlation's in the data.
16
Database System Concepts 4th Edition
22.16
©Silberschatz, Korth and Sudarshan
Knowledge Representation Using Rules
 General form of rules  X antecedent  consequent
 X is a list of one or more variables with associated ranges.
 The rule  transactions T, buys (T, bread)  buys(T, milk)
states: if there is a tuple (t1, bread) in the relation buys, there
must also be a tuple (t1, milk) in the relation buys.
 Population: Cross-product of the ranges of the variables in
the rule.
 In the above example, the set of all transactions.
 Support: Measure of what fraction of the population satisfies
both the antecedent and the consequent of the rule.
 e.g., 10% of transactions buy bread and milk.
 Confidence : Measure of how often the consequent is true
when the antecedent is true.
 e.g., 80% of transactions that buy bread also buy milk
17
Database System Concepts 4th Edition
22.17
©Silberschatz, Korth and Sudarshan
Some Classes of Data-Mining Problems
 Classification : Finding rules that partition the given data
into disjoint groups (classes) that are relevant for making a
decision
 (e.g.,: which of several factors help classify a person’s credit
worthiness).
 Associations: Useful to determine associations between
different items
 (e.g.,: someone who buys bread is quite likely also to buy
milk).
 Sequence correlations: determine correlations between
related sequence data.
 (e.g., : when bond rates go up stock prices go down within
two days.)
18
Database System Concepts 4th Edition
22.18
©Silberschatz, Korth and Sudarshan
User-Guided Data Mining
 In user-guided data mining, primary responsibility for
discovering rules is with the user.
 User may runs tests on the database to verify or refute a
hypothesis. Confidence and support for rules expressing
a hypothesis are derived from the database.
 An iterative process of forming and refining rules is used.
Example: Test the hypothesis “People who hold master’s
degrees are the most likely to have an excellent credit
rating.” If confidence of rule is low, may refine it into the
rule:
 people P, P.degree = Masters and C.income  75, 000
 C.credit = excellent
 Data-visualization though graphical representations like
maps, charts, and color-coding, helps detect patterns in
data
19
Database System Concepts 4th Edition
22.19
©Silberschatz, Korth and Sudarshan
Classification Rules
 Classification rules help assign new objects to a set of classes.
E.g., given a new automobile insurance applicant, should he or
she be classified as low risk, medium risk or high risk?
 Classification rules for above example could use a variety of
knowledge, such as educational level of applicant, salary of
applicant, age of applicant, etc.
 Classification rules can be compactly shown as a Classification
tree.
20
Database System Concepts 4th Edition
22.20
©Silberschatz, Korth and Sudarshan
Part of Credit Risk Classification Tree
21
Database System Concepts 4th Edition
22.21
©Silberschatz, Korth and Sudarshan
Discovery of Classification Rules
 Training set: a data sample in which the grouping for each
tuple is already known.
 Top down generation of classification tree.
 Each internal node of the tree partitions the data into groups
based on the attribute.
 The data at a node is not partitioned further if either all (or most)
of the items at the node belong to the same class, or all
attributes have been considered. Such a node is a leaf node.
 Otherwise the data at the node is partitioned further by picking
an attribute for partitioning data at the node.
22
Database System Concepts 4th Edition
22.22
©Silberschatz, Korth and Sudarshan
Discovery of Classification Rules (Cont.)
 Consider credit risk example: Suppose degree is chosen
to partition the data at the root. Since degree has a small
number of possible values, one child is created for each
value.
 At each child node of the root, further classification is
done tuple if required. Here, partitions are defined by
income. Since income is a continuous attribute, some
number of intervals are chosen, and one child created for
each interval.
 Different classification algorithms use different ways of
choosing which attribute to partition on at each node, and
what the intervals, if any, are.
 In general, different branches of the tree could grow to
different levels. Different nodes at the same level may use
difficult partitioning attributes.
23
Database System Concepts 4th Edition
22.23
©Silberschatz, Korth and Sudarshan
Discovery of Association Rules
 Example:  transactions T, buys (T, bread)  buys (T, milk)
 In general: notion of transaction , and its intemset, the set of items
contained in the transaction
 General form of rule:
 transactions T, c(T, i1) and . . . and c(T, io)  c(T, i0)
where c(T, ik) denotes that transaction T contains item ik.
Above can be represented as A  b where A = {i1, i2, . . . , in}
and b = io.
 Support of rule = number of transactions whose itemsets contain
A  {b}
 Usually desire rules with strong support, which will involve only
items purchased in a significant percentage of the transactions.
24
Database System Concepts 4th Edition
22.24
©Silberschatz, Korth and Sudarshan
Discovery of Association Rules (Cont.)
 Consider all possible sets of relevant items.
 For each set find its support (i.e. , how many transactions
purchase all items in the set).
 Use sets with sufficiently high support to generate
association rules.
 From set A generate the rules A - {b} b for each b  A.
 Support of each of the rules is support of A.
 Confidence of a rule is support of A divided by support of
A - {b}.
25
Database System Concepts 4th Edition
22.25
©Silberschatz, Korth and Sudarshan
Finding Support
 Few sets: Determine level of support via a single pass.
 A count is maintained for each set, initially set to 0.
 When a transaction is fetched, the count is incremented for each
set of items which contained in the itemset of the transaction.
 Sets with a high count at the end of the pass correspond to items
with a high degree of association.
 Many sets: If memory not enough to hold all counts for all sets
Use multiple passes, considering only some sets in each
pass.
 Optimization: Once a set is eliminated because it occurs in too
small a fraction of the transactions, none of its supersets
needs to be considered.
26
Database System Concepts 4th Edition
22.26
©Silberschatz, Korth and Sudarshan
Data Warehousing
 A data warehouse is a repository of information gathered
from multiple sources.
27
Database System Concepts 4th Edition
22.27
©Silberschatz, Korth and Sudarshan
Data Warehousing (Cont.)
 Provides a single consolidated interface to data
 Data stored for an extended period, providing access to
historical data
 Data/updates are periodically downloaded form online
transaction processing (OLTP) systems.
 Typically, download happens each night.
 Data may not be completely up-to-date, but is recent enough for
analysis.
 Running large queries at the warehouse ensures that OLTP
systems are not affected by the decision-support workload.
28
Database System Concepts 4th Edition
22.28
©Silberschatz, Korth and Sudarshan
Issues in Building a Warehouse
 When and how to gather data.
 Source driven: data source initiates data transfer
 Destination driven: warehouse initiates data transfer
 What schema to use.
 Schema integration
 Cleaning and conversion of incoming data
 What data to summarize.
 Raw data may be too large to store on-line
 Aggregate values (totals/subtotals) often suffice
 Queries on raw data can often be transformed by query optimizer to
use aggregate values
 How to propagate updates.
 Date at warehouse is a view on source data
 Efficient view maintenance techniques required
29
Database System Concepts 4th Edition
22.29
©Silberschatz, Korth and Sudarshan
Information Retrieval Systems
 Information retrieval (IR) systems use a simpler data
model than database systems, but provide more
powerful querying capabilities within the restricted
model.
 Queries attempt to locate documents that are of interest
by specifying, for example, sets of keywords.
 e.g., find documents containing the words “database
systems”
 Information retrieval systems order answers based on
their estimated relevance.
 e.g., user may really only want documents about
database systems, but the system may retrieve all
documents that mention the phrase database systems”.
 Documents may be ordered by, for example, how many
times the phrase appears in the document.
30
Database System Concepts 4th Edition
22.30
©Silberschatz, Korth and Sudarshan
Queries
 Combinations of keywords
 motorcycle and maintenance
 computer or micro-processor
 computer but not database
 Closeness of keyword s in the and case affects ranking.
Some systems allow user to specify that the keywords must
occur close to each other.
 Synonyms
 To retrieve document title motorcycle repair for the query
motorcycle and maintenance, need to realize that maintenance
and repair are synonyms
 Similarity based retrieval - retrieve documents similar to a
given document. Similarity may be defined based on metrics
such as number of common keywords.
31
Database System Concepts 4th Edition
22.31
©Silberschatz, Korth and Sudarshan
Differences From Database Systems
 Information retrieval systems, unlike traditional database
systems, handle:
 Unstructured documents
 Searching using keywords and relevance ranking
 Most information retrieval systems do not handle:
 High update rates
 Concurrency control
 Data structured using more complex data models (e.g., relational
or object oriented data models)
 Complex queries written in, e.g., SQL
32
Database System Concepts 4th Edition
22.32
©Silberschatz, Korth and Sudarshan
Indexing of Documents
 Documents that contain a specified keyword can be
located using an inverted index which maps each
keyword Ki to the set Si of identifiers of documents that
contain Ki .
 IR systems save space by using index structures that
support only approximate retrieval. May result in:
 false drop - some relevant documents may not be
retrieved.
 false positive - some irrelevant documents may be
retrieved.
 For many applications a good index should not permit any
false drops, but may permit a few false positives.
 Relevant performance metrics:
 Precision - what percentage of the retrieved documents are
relevant to the query.
 Recall - what percentage of the documents relevant to the
query were retrieved.
Database System Concepts 4th Edition
22.33
33
©Silberschatz, Korth and Sudarshan
Indexing of Documents (Cont.)
 and operation: Finds documents that contain all of a set of
keywords K1, K2, ..., Kn.
 Retrieve the corresponding sets of identifiers of documents
S1, S2, ... Sn.
 The intersection, S1 S2 .....  Sn, gives the identifiers of the
desired set of documents.
 or operation: Gives the set of all documents that contain at least
one of the keywords K1, K2, …, Kn
 Found by computing the union, S1 S2 .....  Sn, of the sets.
34
Database System Concepts 4th Edition
22.34
©Silberschatz, Korth and Sudarshan
Indexing of Documents (Cont.)
 not operation: Finds documents that do not contain a
specified keyword Ki
 Let Si be the set of identifiers of documents that contain the
keyword Ki.
 Given a set of document identifies S, eliminate documents
that contain the specified keyword Ki by taking the difference
S-Si
 A full-text index uses every work in the document as a
keyword.
 Stop words are very commonly occurring words that are
useless as key words, e.g, a, an, the, it etc. These are
eliminated from the index.
35
Database System Concepts 4th Edition
22.35
©Silberschatz, Korth and Sudarshan
Browsing
 Storing related documents together facilitates browsing,
where a user can see not only requested document but
also related ones.
 Browsing in a library facilitated by classification system
that organizes logically related books together.
36
Database System Concepts 4th Edition
22.36
©Silberschatz, Korth and Sudarshan
A Classification Tree
37
Database System Concepts 4th Edition
22.37
©Silberschatz, Korth and Sudarshan
Classification DAG
 Documents can reside in multiple places in a hierarchy
in an information retrieval system, since physical
location is not important.
 Classification hierarchy is thus Directed Acyclic Graph
(DAG).
38
Database System Concepts 4th Edition
22.38
©Silberschatz, Korth and Sudarshan
A Classification DAG for a library
information retrieval system
39
Database System Concepts 4th Edition
22.39
©Silberschatz, Korth and Sudarshan
Locating of Information on the Web
 The Archie system automatically follows Gopher links to
locate information, and creates a centralized index of
information found various sites.
 Web indexing systems (Web crawlers) follow the
hypertext links in documents to find other documents,
and build an index on the documents.
 The indices are often full-text indices, and are stored
locally at the indexing system.
 These systems run a background process to
 find new sites.
 obtain updated information from known sites.
 discard defunct sites.
40
Database System Concepts 4th Edition
22.40
©Silberschatz, Korth and Sudarshan
Locating of Information on the Web (Cont.)
 Web indexing systems permit documents to be located
even though they are not registered with any central
authority.
 Drawback: poor precision of recall
 Full text indexing retrieves unrelated documents that just
happen to mention the requested keyword
 HTML extensions now allow documents to be tagged with
keywords to be used by search engines; unfortunately,
many documents do not provide such keywords.
 The extremely large number of documents on the Web
often leads to far more results than a human can handle.
 Alternative approach: a cataloging system for the Web,
such as that provide by Yahoo
 Combinations of catalogs and indexing are quite useful
Provided by, e.g., Yahoo (www.yahoo.com).
41
Database System Concepts 4th Edition
22.41
©Silberschatz, Korth and Sudarshan
Classification Tree
42
Database System Concepts 4th Edition
22.42
©Silberschatz, Korth and Sudarshan
Data-Warehouse Architecture
43
Database System Concepts 4th Edition
22.43
©Silberschatz, Korth and Sudarshan
Star Schema For A Data Warehouse
44
Database System Concepts 4th Edition
22.44
©Silberschatz, Korth and Sudarshan
A Classification Hierarchy For A Library System
45
Database System Concepts 4th Edition
22.45
©Silberschatz, Korth and Sudarshan
A Classification DAG For A Library
Information Retrieval System
46
Database System Concepts 4th Edition
22.46
©Silberschatz, Korth and Sudarshan
Data Analysis and OLAP
 Online Analytical Processing
 Data that can be modeled as dimension attributes and measure
attributes are called multidimensional data.
 Given a relation used for data analysis, we can identify some of its
attributes as measure attributes, since they measure some value,
and can be aggregated upon. For instance, the attribute number of
the sales relation is a measure attribute, since it measures the
number of units sold.
 Some of the other attributes of the relation are identified as
dimension attributes, since they define the dimensions on which
measure attributes, and summaries of measure attributes, are
viewed.
47
Database System Concepts 4th Edition
22.47
©Silberschatz, Korth and Sudarshan
Extended Aggregation
 SQL:1999 also supports generalizations of the group by constructs,
using the cube and rollup constructs. A representative use of the
cube construct is;
select item-name, color, size, sum(number)
from sales
group by cube(item-name, color, size)
 This query computes the union of eight different groupings of the
sales relation:
{ (item-name, color, size), (item-name, color), (item-name, size),
(color, size), (item-name), (color), (size), ( ) }
Where ( ) denotes an empty group by list.
 For each grouping, the result contains the null value for attributes not
present in the grouping. For instance, with occurrences of all replaced
by null, can be computed by the query
select item-name, color, sum(number)
from sales
group by cube(item-name, color)
48
Database System Concepts 4th Edition
22.48
©Silberschatz, Korth and Sudarshan
Extended Aggregation (Cont.)
 A representative rollup construct is
select item-name, color, size, sum(number)
from sales
group by rollup(item-name, color, size)
 Here only four grouping are generated:
{ (item-name, color, size), (item-name, color), (item-name), ( ) }
 Rollup can be used to generate aggregates at multiple levels of a
hierarchy on a column. For instance, we have a table
itemcategory(item-name, category) giving the category of each
item. Then the query
select category, item-name, sum(number)
from sales, category
where sales.item-name = itemcategory.item-name
group by rollup(category, item-name)
would give a hierarchical summary by item-name and by category.
49
Database System Concepts 4th Edition
22.49
©Silberschatz, Korth and Sudarshan
Extended Aggregation (Cont.)
 Multiple rollups and cubes can be used in a single group by clause.For
instances, the following query
select item-name, color, size, sum(number)
from sales
group by rollup(item-name), rollup(color, size)
generates the groupings
{ (item-name, color, size), (item-name, color), (item-name), (color, size),
(color), ( ) }
 The function grouping can be applied on an attribute; it returns 1 if the value
is a null value representing all, and returns 0 in all other cases. Consider the
following query:
select item-name, color, size, sum(number),
grouping(item-name) as item-name-flag,
grouping(color) as color-flag,
grouping(size) as size-flag,
from sales
group by cube(item-name, color, size)
50
Database System Concepts 4th Edition
22.50
©Silberschatz, Korth and Sudarshan
Ranking
 Ranking is done in conjunction with an order by specification.
Suppose we are given a relation student-marks(student-id, marks)
which stores the marks obtained by each student. The following query
gives the rank of each student.
select student-id, rank( ) over (order by (marks) desc as s-rank
from student-marks
 An extra order by clause is needed to get them in sorted order, as
shown below.
select student-id, rank ( ) (order by (marks) desc) as s-rank
from student-marks order by s-rank
 Ranking can be done within partition of the data. The following query
then gives the rank of students within each section.
select student-id, section,
rank ( ) over (partition by section order by marks desc) as
sec-rank
from student-marks, student-section
where student-marks.student-id = student-section.student-id
order by section, sec-rank
51
Database System Concepts 4th Edition
22.51
©Silberschatz, Korth and Sudarshan
Ranking (Cont.)
 For a given constant n, the ranking the function ntile(n) takes the
tuples in each partition in the specified order, and divides them
into n buckets with qual numbers of tuples. For instance, we an
sort employees by salary, and use ntile(3) to find which range
(bottom third, middle third, or top third) each employee is in, and
compute the total salary earned by employees in each range:
select threetile, sum(salary)
from (
select salary, ntile(3) over (order by (salary) as threetile
from employee) as s
group by threetile
 SQL:1999 permits the user to specify where they should occur
by using nulls first or nulls last, for instance
select student-id, rank ( ) over (order by marks desc nulls
last) as s-rank
from student-marks
52
Database System Concepts 4th Edition
22.52
©Silberschatz, Korth and Sudarshan
Windowing
 An example of window query is that, given sales values for each
date, calculates for each date the average of the sales on that
day, the previous day, and the next day; such moving average
queries are used to smooth out random variations.
 In contrast to group by, the same tuple can exist in multiple
windows. Suppose we are given a relation transaction(accountnumber, date-time, value), where value is positive for a deposit
and negative for a withdrawal, consider a query
select account-number, date-time,
sum(value) over
(partition by account-number
order by date-time
rows unbounded preceding)
as balance
from transaction
order by account-number, date-time
53
Database System Concepts 4th Edition
22.53
©Silberschatz, Korth and Sudarshan
Decision - Tree Construction
 The main idea of decision tree construction is to evaluate different attributes
and different partitioning conditions and pick the attribute and partitioning
condition that results in the maximum information gain ratio.
Procedure Grow.Tree(S)
Partition(S);
Procedure Partition (S)
if (purity(S) > p or |S| < s) then
return;
for each attribute A
evaluate splits on attribute A;
Use best split found (across all attributes) to partition
S into S1, S2, …., Sr,
for i = 1, 2, ….., r
Partition(Si);
 We can generate classification rules from a decision tree, if we so desire. For
each leaf we generate a rule as follows: The left hand side is the conjunction
of all the split conditions on the path to the leaf, and the class is the class of
the majority of the training instances at the leaf. An example of such a
classification rule is
degree = masters and income > 75,000  excellent
54
Database System Concepts 4th Edition
22.54
©Silberschatz, Korth and Sudarshan
Other Types of Classifiers
 There are two types of classifiers
 Neural net classifiers
 Bayesian classifiers
 Neural net classifiers use the training data to train artificial neural nets.
 To find the probability p(cj|d) of instance d being in class cj, Bayesian
classifiers use Baye’s theorem, which says
p(cj|d) = p(d|cj)p(cj)
p(d)
where p(d|cj) is the probability of generating instance d given class cj, p(cj) is
the probability of occurrence of class cj, and p(d) is the probability of
instanced occuring.
 To simplify the task, naïve Bayesian classifiers assume attributes have
independent distributions, and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* (p(dn|cj)
That is, the probability of the instance d occuring is the product of the
probability of occurrence of each of the attribute values di of d, given the
class cj.
55
Database System Concepts 4th Edition
22.55
©Silberschatz, Korth and Sudarshan
Regression
 Regression deals with the prediction of a value, rather than a
class. Given values for a set of variables, X1, X2, …, Xn, we
wish to predict the value of a variable Y.
 One way is to infer coefficients a0, a1, a1, …, an such that
Y = a0 + a1 * X1 + a2 * X2 + … + an * Xn
 Finding such a linear polynomial is called linear regression. The
process to find a curve that fits the data is called curve fitting.
 The fit may only be approximate, because of noise in the data or
because the relationship is not exactly a polynomial, so
regression aims to find coefficients that give the best possible fit.
56
Database System Concepts 4th Edition
22.56
©Silberschatz, Korth and Sudarshan
Association Rules
 Retail shops are often interested in associations between
different items that people buy. Examples of such associations
are:
 Someone who buys bread is quite likely also to buy milk
 A person who bought the book Database System Concepts is quite
likely also to buy the book Operating System Concepts.
 Associations information can be used in several ways. When a
customer buys a particular book, an online shop may suggest
associated books.
 Association Rules
An example of Association Rules is
bread  milk
An association rule must have an associated population; the
population consists of a set of instances.
57
Database System Concepts 4th Edition
22.57
©Silberschatz, Korth and Sudarshan
Association Rules (Cont.)
 Association Rules: An example of Association Rules is
bread  milk
 An association rule must have an associated population; the population
consists of a set of instances.
Rules have an associated support, as well as an associated confidence.
These are defined in the context of population:
 Support is a measure of what fraction of the population satisfies both the
antecedent and the consequent of the rule.For instance, suppose only
0.001 percent of all purchases include milk and screwdrivers. The support
for the rule is milk  screwdrivers is low.
 Confidence is a measure of how often the consequent is true when the
antecedent is true. For instance, the rule bread  milk has a confidence
of 80 percent if 80 percent of the purchases that include bread also
include milk.A rule with a low confidence is not meaningful.
Note that the confidence of bread  milk may be very different from the
confidence of milk  bread, although both have the same supports.
58
Database System Concepts 4th Edition
22.58
©Silberschatz, Korth and Sudarshan
Clustering
59
Database System Concepts 4th Edition
22.59
©Silberschatz, Korth and Sudarshan
Other Types of Mining
60
Database System Concepts 4th Edition
22.60
©Silberschatz, Korth and Sudarshan
Data Warehousing
61
Database System Concepts 4th Edition
22.61
©Silberschatz, Korth and Sudarshan
Components of Data Warehouse
 When and how to gather data.
 What schema to use.
 Data cleansing
 How to propagate updates.
 What data to summarize
62
Database System Concepts 4th Edition
22.62
©Silberschatz, Korth and Sudarshan
Warehouse Schemas
63
Database System Concepts 4th Edition
22.63
©Silberschatz, Korth and Sudarshan
Information-Retrieval Systems
64
Database System Concepts 4th Edition
22.64
©Silberschatz, Korth and Sudarshan
Keyword Search
 Information-retrieval systems typically allow query expressions
formed using keywords and the logical connectives and, or, and
not.
 A query containing keywords without any of the above
connectives is assumed to have ands implicitly connecting the
keywords
 In full text retrieval, all the words in each document are
considered to be keywords. We shall use the word term to refer
to the words in a document, since all words are keywords.
65
Database System Concepts 4th Edition
22.65
©Silberschatz, Korth and Sudarshan
Relevance Ranking Using Terms
 One way of measuring r(d, t), the relevance of a document d to a
term t, is
r(d, t) = log
n(d, t)
1+
n(d)
Where n(d) denotes the number of terms in the document and
n(d, t) denotes the number of occurrences of term t in the
document d.
r(d, t)
r(d, Q) =  n(t)
tQ
66
Database System Concepts 4th Edition
22.66
©Silberschatz, Korth and Sudarshan
Relevance Using Hyperlinks
67
Database System Concepts 4th Edition
22.67
©Silberschatz, Korth and Sudarshan
Similarity-Based Retrieval
68
Database System Concepts 4th Edition
22.68
©Silberschatz, Korth and Sudarshan
Synonyms and Homonyms
69
Database System Concepts 4th Edition
22.69
©Silberschatz, Korth and Sudarshan
Indexing of Documents
70
Database System Concepts 4th Edition
22.70
©Silberschatz, Korth and Sudarshan
Measuring Retrieval Effectiveness
71
Database System Concepts 4th Edition
22.71
©Silberschatz, Korth and Sudarshan
Web Search Engines
72
Database System Concepts 4th Edition
22.72
©Silberschatz, Korth and Sudarshan
Directories
73
Database System Concepts 4th Edition
22.73
©Silberschatz, Korth and Sudarshan
74
Database System Concepts 4th Edition
22.74
©Silberschatz, Korth and Sudarshan
Applications of Data Mining
 person P, P.degree = masters and P.income > 75,000 
P.credit – excellent
person P, P.degree = bachelors or (p.income  25,000 and
P.income  75,000)  P.credit = good
75
Database System Concepts 4th Edition
22.75
©Silberschatz, Korth and Sudarshan
Best Splits
 The purity of a set S of training instances can be measured quantitatively in
several ways. Suppose there are k classes, and of the instances in S the
fraction of instances in class I is pi. One measure of purity, the Gini measure is
defined as
k
[
Gini (S) = 1 - 
p2i
i- 1
 When all instances are in a single class, the Gini value is 0, while it reaches its
maximum (of 1 –1 /k) if each class the same number of instances. Another
measure of purity is the entropy measure, which is defined as
k
Entropy (S) = - 
pilog2pi
i- 1
 When a set S is split into multiple sets Si, I=1, 2, …, r, we can measure the
purity of the resultant set of sets as:
r
Purity(S1, S2, ….., Sr) = 
|Si|
i= 1 |S|
purity (Si)
The above formula can be used with both the Gini measure and the entropy
76
measure of purity.
Database System Concepts 4th Edition
22.76
©Silberschatz, Korth and Sudarshan
Best Splits (Cont.)
 The information gain due to particular split of S into Si, I = 1, 2,
…., r is then
Information-gain (S,{S1, S2, …., Sr) = purity(S) – purity (S1, S2, …
Sr)
 The information content of a particular split can be defined in
terms of entropy as
r
Information-content(S, {S1, S2, ….., Sr})) = - 
|Si|
i- 1 |S|
log2
|Si|
|S|
 All of this leads to a definition: The best split for an attribute is the
one that gives the maximum information gain ratio, defined as
Information-gain (S{S1, S2, ……, Sr})
Information-content (S, {S1, S2, ….., Sr})
77
Database System Concepts 4th Edition
22.77
©Silberschatz, Korth and Sudarshan