Transcript Document

Data Mining
Overview
Business Intelligence
Data Mining Defined
Knowledge discovery in databases
 Extracting implicit, previously
unknown information from large
volumes of raw data

Instances and Features
Typically, the database will be a
collection of instances
 Each instance will have values for a
given set of features
 From database theory:
instances are rows,
features are columns

Classification
Supervised learning
 Suppose instances have been
categorized into classes and the
database includes this categorization
 Goal: using the “knowledge” in the
database, classify a given instance

Classifiers
feature
values
X1
X2
X3
…
Y
Classifier
category
Xn
DB
collection of instances
with known categories
Classifier intelligence
A classifier’s intelligence will be based
on a dataset consisting of instances
with known categories
 Typical goal of a classifier: predict the
category of a new instance that is
rationally consistent with the dataset

BI Examples


A loans officer in a bank uses a system that
automatically approves or disapproves a
loan application based on previous loan
applications and decisions
An admissions officer in a university uses a
system that automatically makes an
admission decision (accept, reject, wait-list),
based on previous applicants’ data and
decisions made on them
Data mining method example:
k - nearest neighbors

For a given instance T, get the top k
database instances that are “nearest”
to T

Select a reasonable distance measure
Inspect the category of these k
instances, choose the category C that
represent the most instances
 Conclude that T belongs to category C

Clustering
Unsupervised learning
 Classes/categories are not known, but
unexpected groupings (clusters) are
discovered
 Clustering provides insight into the
population segments

Feature 2
Clustering
Feature 1
Goal of Clustering
Input: the database of instances, and
possibly some predetermined number
of clusters
 Output: the same database of
instances partitioned into clusters

BI Examples

After clustering the current university
student population, it was discovered that
there is a large group of female marketing
majors coming from a particular exclusive
school who tend to get high grades


business response: focus recruitment on
that school; push the university’s marketing
program
Customer segment characteristics and
spending patterns can direct business
strategies
Data mining method
example: k-means
Guess the number of clusters (k)
 Guess cluster centers from the
samples (these will be called
centroids)
 Determine cluster membership based
on the distance from the centroids
 Repeatedly refine the centroids by
getting the average (mean) of the
members of each cluster

Summary
Two sub-areas of data mining have
been discussed: supervised
(classification) and unsupervised
(clustering) learning methods
 For both types of methods, intelligent
systems can be created to support
business decision making
